Introduction

The purpose of this package is to aid in interpretation of rich and complex enrichment or GSEA results. The first step is to cluster gene sets based on the overlap coefficient, which is a metric of similarity that does not punish differences in gene set size. This allows child sets in hierarchical references like GO to cluster with their parents. The package then allows various visualizations of clusters to help summarize biological functions that might be contained within a long list of significantly enriched gene sets.

Installation

The package can be installed from GitHub using:

install.packages("devtools")
devtools::install_github("BIGslu/madRich")

Usage Example

library(madRich)

Sample Enrichment

This is a hypergeometric enrichment calculated with the SEARchways package. It uses a gene list from that package, and performs an enrichment against the C2: Canonical Pathways and C5: GO Biological Processes from the Broad Institute’s Molecular Signatures Database (MSigDB).

gene_list2 <- list(HRV1 = names(SEARchways::example.gene.list[[1]]),
                   HRV2 = names(SEARchways::example.gene.list[[2]]))
df1 <- SEARchways::BIGprofiler(gene_list=gene_list2, 
                               collection="C2", subcollection ="CP", ID="ENSEMBL")
#> [1] "HRV1"
#> 
#> [1] "HRV2"
df2 <- SEARchways::BIGprofiler(gene_list=gene_list2, 
                               collection="C5", subcollection="GO:BP", ID="ENSEMBL")
#> [1] "HRV1"
#> [1] "HRV2"
enrichment <- dplyr::bind_rows(df1, df2)

This results looks like this:

head(enrichment)
#> # A tibble: 6 × 14
#>   group n_query_genes gs_collection gs_subcollection n_background_genes pathway 
#>   <chr>         <int> <chr>         <chr>                         <dbl> <chr>   
#> 1 HRV1            100 C2            CP:KEGG_MEDICUS               15461 KEGG_ME…
#> 2 HRV1            100 C2            CP:REACTOME                   15461 REACTOM…
#> 3 HRV1            100 C2            CP:WIKIPATHWAYS               15461 WP_MITO…
#> 4 HRV1            100 C2            CP:WIKIPATHWAYS               15461 WP_NICO…
#> 5 HRV1            100 C2            CP:REACTOME                   15461 REACTOM…
#> 6 HRV1            100 C2            CP:REACTOME                   15461 REACTOM…
#> # ℹ 8 more variables: n_pathway_genes <dbl>, n_query_genes_in_pathway <dbl>,
#> #   `k/K` <dbl>, pval <dbl>, FDR <dbl>, qvalue <dbl>, genes <list>,
#> #   pathway_GOID <chr>

Whatever your enrichment format, the following steps require a data frame with the following columns: pathway, k/K (for hypergeometric) or NES (for GSEA), FDR (if applying a cutoff), collection & subcollection (if not providing a custom database). If you are enriching against a custom database rather than Broad databases, you must provide this as a data frame with the columns “gs_name” for pathways and “gene” for member genes.

Guiding the Number of Gene Set Clusters

This function creates a list of objects to help guide the choice of tree cut height when finding gene set clusters with clusterSets(). It requires an enrichment or GSEA input. For hypergeometric enrichment, it will provide a single plot and “best” cut height value. For GSEA, it will provide a sign-separated result (one plot and height for positive NES pathways, one for negative).

The function requires either (A) a vector of Broad gene set collections represented in the enrichment result with a named vector of subcollections, or (B), a custom database frame with the columns “gs_name” for pathways and “gene” for member genes.

Also required is a vector of tree cut heights for hclust to calculate values of K and sillhouette scores.

cut_height_test1 <- chooseCutHeight(df = enrichment, 
                                   enrich_method="hypergeometric",
                                   ID = "ENSEMBL",
                                   collections = c("C5", "C2"),
                                   subcollections = c("C2" = "CP", "C5" = "GO:BP"),
                                   hclust_heights = c(0.3,0.5,0.7,0.9),
                                   group_name = "HRV1",
                                   fdr_cutoff = 0.4,
                                   kK_cutoff = 0.03)

The first is the formatted database with only the pathways passing any applied strength/significance cutoffs

head(cut_height_test1$database_format)
#> # A tibble: 6 × 4
#>   sign     pathway             gene            gs_description                   
#>   <chr>    <chr>               <chr>           <chr>                            
#> 1 pathways GOBP_AXON_EXTENSION ENSG00000097007 Long distance growth of a single…
#> 2 pathways GOBP_AXON_EXTENSION ENSG00000143199 Long distance growth of a single…
#> 3 pathways GOBP_AXON_EXTENSION ENSG00000101126 Long distance growth of a single…
#> 4 pathways GOBP_AXON_EXTENSION ENSG00000170017 Long distance growth of a single…
#> 5 pathways GOBP_AXON_EXTENSION ENSG00000176248 Long distance growth of a single…
#> 6 pathways GOBP_AXON_EXTENSION ENSG00000130203 Long distance growth of a single…

The second is a summary of k at the provided cut heights (sign separated in the case of GSEA input).

cut_height_test1$summary_df
#>   cut_height k_at_height avg_silhouette_width sign
#> 1        0.3         134            0.3089300   NA
#> 2        0.5         114            0.3108784   NA
#> 3        0.7          94            0.3098215   NA
#> 4        0.9          46            0.3130909   NA

The third is a plot showing average silhouette score by number of clusters.

cut_height_test1$silhouette_score_plot

And the last is a “best” cut height selected from the input vector of possible cut heights that maximized silhouette width.

cut_height_test1$best_height
#> [1] 0.9

It may benefit to run this function through several times, narrowing in on a cut height that feels best. In our first example, 0.9 was the “best” height. According to our plot, across all possible values of k the average silhouette width is optimized at k = 122 (height between 0.3 and 0.5). BUT, there is also a plateau that matters here. There is not a ton if difference in silhouette score between k ~ 40 and k ~ 140. But reducing our gene set list from ~ 200 sets to ~ 140 clusters isn’t super helpful. Dropping from ~ 200 sets to ~ 20 clusters WOULD be. So let’s try several values above 0.9 to see if we can get a lower and more useful value for k.

cut_height_test2 <- chooseCutHeight(df = enrichment, 
                                   enrich_method="hypergeometric",
                                   ID = "ENSEMBL",
                                   collections = c("C5", "C2"),
                                   subcollections = c("C2" = "CP", "C5" = "GO:BP"),
                                   hclust_heights = c(0.95, 0.96, 0.97),
                                   group_name = "HRV1",
                                   fdr_cutoff = 0.4,
                                   kK_cutoff = 0.03)
cut_height_test2$best_height
#> [1] 0.96

A cut height of 0.96 gets us k = 30, which is pretty close to a local maximum on the plot. So I am going to go with that. You may have reasons to choose a value that is NOT the overall (or even a local) maximum silhouette score value. You may find that you need more clusters to break down a complicated enrichment. Or the “optimal” value may provide too FEW clusters to be useful for your analysis. You can see that there are several local maxima along the curve of values of k. These are relatively better clustering solutions than their neighbors, so try to choose a local maximum even if deviating from the absolute maximum. And keep in mind the possible diminishing returns of adding more clusters when your plot has a large plateau like this one.

Clustering Gene Sets

This function will generate the final clusters. Its inputs are very similar to the previous, and the output will provide the data for all the visualizations in this package.

cluster_solution <- clusterSets(df = enrichment, 
                                enrich_method="hypergeometric",
                                ID = "ENSEMBL",
                                collections = c("C5", "C2"),
                                subcollections = c("C2" = "CP", "C5" = "GO:BP"),
                                hclust_height = cut_height_test2$best_height,
                                group_name = "HRV1",
                                fdr_cutoff = 0.4,
                                kK_cutoff = 0.03)

Plotting Clusters

Dendrogram

The dendrogram shows the tree generated in hclust, with node colors and numbers showing cluster assignment of each gene set.

clusterDendro(cluster_solution)

Scatterplot

The scatterplot shows a dimension reduction (tSNE, UMAP, or PCoA) of clustered sets. The size of circles in the scatterplot can indicate significance (-log10(pval)) or gene set size. I think tSNE does the best job of visually dispersing the points, but your ability to use tSNE or UMAP may depend on the underlying variability in your particular clustering solution.

clusterScatter(cluster_solution, dimred = "tSNE", scores = "pvalue")

Treemap

The treemap plot shows gene set clusters together and sizes boxes based on either gene set size or -log10(pval). It is a good way to visualize the clusters, their member sets, and their relative significance. In this case, I can see that my cluster 1 is a large and possibly informative cluster. So I will focus on that one for the next visualization.

clusterTreemap(cluster_solution, scores = "pvalue")

Wordclouds

These plots take text from gene set names and descriptions and generate word clouds.

wordclouds <- clusterWordclouds(cluster_solution)
#> Joining with `by = join_by(pathway)`

So wordclouds$wordcloud_names[[“1]] would have word clouds made from the gene set NAMES of sets in cluster 1. The function removes common words like articles, but additional words can be removed with the “rmwords” parameter in clusterWordclouds.

wordclouds$wordcloud_names[["1"]]

This gives us some insight into what biological function might be represented by this subset of our enrichment result. There is also a “wordcloud_descriptions” list which creates wordclouds based on the gs_description column in the Broad databases. And “wordclouds_both” list which uses both set names and their descriptions. Note that these word clouds are only useful for a custom database if you have well-curated gs_name and gs_description columns in the input.

wordclouds$wordcloud_names[["18"]]

wordclouds$wordcloud_descriptions[["18"]]

wordclouds$wordcloud_both[["18"]]

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.2
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Los_Angeles
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] madRich_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3      ggdendro_0.2.0          rstudioapi_0.17.1      
#>   [4] jsonlite_2.0.0          magrittr_2.0.4          ggtangle_0.0.9         
#>   [7] farver_2.1.2            rmarkdown_2.30          fs_1.6.6               
#>  [10] vctrs_0.6.5             memoise_2.0.1           ggtree_3.16.3          
#>  [13] rstatix_0.7.3           htmltools_0.5.9         curl_7.0.0             
#>  [16] broom_1.0.11            Formula_1.2-5           gridGraphics_0.5-1     
#>  [19] sass_0.4.10             bslib_0.9.0             plyr_1.8.9             
#>  [22] cachem_1.1.0            ggfittext_0.10.3        commonmark_2.0.0       
#>  [25] igraph_2.2.1            lifecycle_1.0.4         iterators_1.0.14       
#>  [28] pkgconfig_2.0.3         Matrix_1.7-4            R6_2.6.1               
#>  [31] fastmap_1.2.0           gson_0.1.0              GenomeInfoDbData_1.2.14
#>  [34] digest_0.6.39           aplot_0.2.9             enrichplot_1.28.4      
#>  [37] colorspace_2.1-2        patchwork_1.3.2         AnnotationDbi_1.70.0   
#>  [40] S4Vectors_0.46.0        textshaping_1.0.4       RSQLite_2.4.5          
#>  [43] ggpubr_0.6.2            labeling_0.4.3          httr_1.4.7             
#>  [46] abind_1.4-8             compiler_4.5.2          bit64_4.6.0-1          
#>  [49] withr_3.0.2             S7_0.2.1                backports_1.5.0        
#>  [52] BiocParallel_1.42.2     carData_3.0-5           DBI_1.2.3              
#>  [55] R.utils_2.13.0          ggsignif_0.6.4          MASS_7.3-65            
#>  [58] rappdirs_0.3.3          tools_4.5.2             otel_0.2.0             
#>  [61] ape_5.8-1               msigdbr_25.1.1          R.oo_1.27.1            
#>  [64] glue_1.8.0              treemapify_2.6.0        nlme_3.1-168           
#>  [67] GOSemSim_2.34.0         gridtext_0.1.5          grid_4.5.2             
#>  [70] Rtsne_0.17              cluster_2.1.8.1         reshape2_1.4.5         
#>  [73] fgsea_1.34.2            generics_0.1.4          gtable_0.3.6           
#>  [76] R.methodsS3_1.8.2       tidyr_1.3.2             data.table_1.17.8      
#>  [79] SEARchways_1.5.1        xml2_1.5.1              car_3.1-3              
#>  [82] utf8_1.2.6              XVector_0.48.0          BiocGenerics_0.54.1    
#>  [85] markdown_2.0            ggrepel_0.9.6           foreach_1.5.2          
#>  [88] pillar_1.11.1           stringr_1.6.0           yulab.utils_0.2.2      
#>  [91] babelgene_22.9          splines_4.5.2           dplyr_1.1.4            
#>  [94] treeio_1.32.0           lattice_0.22-7          bit_4.6.0              
#>  [97] tidyselect_1.2.1        GO.db_3.21.0            Biostrings_2.76.0      
#> [100] knitr_1.51              litedown_0.9            IRanges_2.42.0         
#> [103] svglite_2.2.2           stats4_4.5.2            xfun_0.55.3            
#> [106] Biobase_2.68.0          factoextra_1.0.7        stringi_1.8.7          
#> [109] UCSC.utils_1.4.0        lazyeval_0.2.2          ggfun_0.2.0            
#> [112] yaml_2.3.12             evaluate_1.0.5          codetools_0.2-20       
#> [115] ggwordcloud_0.6.2       tibble_3.3.0            qvalue_2.40.0          
#> [118] ggplotify_0.1.3         cli_3.6.5               systemfonts_1.3.1      
#> [121] jquerylib_0.1.4         Rcpp_1.1.0              GenomeInfoDb_1.44.3    
#> [124] png_0.1-8               parallel_4.5.2          ggplot2_4.0.1          
#> [127] assertthat_0.2.1        blob_1.2.4              clusterProfiler_4.16.0 
#> [130] DOSE_4.2.0              tidytree_0.4.6          scales_1.4.0           
#> [133] purrr_1.2.0             crayon_1.5.3            scico_1.5.0            
#> [136] rlang_1.1.6             cowplot_1.2.0           fastmatch_1.1-6        
#> [139] KEGGREST_1.48.1